arXiv:1508.02502vl [stat.ME] 11 Aug 2015 


Projection predictive variable selection using 

Stan+R 


Juho Piironen* and Aki Vehtari 
August 12, 2015 


Abstract 

This document is additional material to our previous study comparing several 
strategies for variable subset selection (Piironen and Vehtari, 2015). Our recom¬ 
mended approach was to fit the full model with all the candidate variables and best 
possible prior information, and perform the variable selection using the projection 
predictive framework. Here we give an example of performing such an analysis, 
using Stan for fitting the model, and R for the variable selection. 


1 Introduction 

Identifying relevant explanatory variables is often of interest in applied statistical anal¬ 
ysis. In this context, the relevance of a variable is usually determined by its predictive 
power on the target variable of interest. The gain from successful variable selection is 
typically improved model interpretability, but depending on the problem, also reduced 
measurement costs and computational savings may be obtained. 

In our recect study (Piironen and Vehtari, 2015), we discussed and compared several 
strategies for performing variable selection in practical problems and concluded that 
often best results are obtained by the following strategy: fit the full model with all 
the candidate variables and best possible prior information, and perform the variable 
selection using the projection predictive method (Goutis and Robert, 1998; Dupuis and 
Robert, 2003). Our comparison showed that this approach generally outperforms the 
selection of the most probable variables or variable combination, methods which are 
often used for performing Bayesian variable selection. 

This document serves as additional material to our study, the purpose being to pro¬ 
vide an example of how to carry out the model fitting with Stan and the subsequent 
variable selection using R. Sampling the full model under the popular spike-and-slab 
prior (Mitchell and Beauchamp, 1988) is infeasible with Stan, but the prior assump¬ 
tions about the sparsity can be conveniently formulated using the hierarchical shrink¬ 
age priors, such as the horseshoe (Carvalho et ah, 2009, 2010), which allow convenient 
computation using Stan. All the relevant codes are provided in the Appendix. 

The document is organized as follows. Section 2 reviews the hierarchical shrinkage in 
the context of linear regression using half-Student-f priors for the weight scales, and 
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discusses the horseshoe prior as a special case. Section 3 shortly reviews the projection 
predictive variable selection and discusses the computations in the example case of 
a linear Gaussian model. Finally, in Section 4 we provide an illustrative numerical 
example. 


2 Hierarchical shrinkage 


Consider the single output linear Gaussian regression model with several input vari¬ 
ables, given by 


= w'^x, 

yi = fi+£i, £~N(0,(t^), 


( 1 ) 


where x is the m-dimensional vector of inputs, w contains the corresponding weights 
and cr^ is the noise variance. A hierarchical shrinkage (HS) prior for the regression 
weights w = (wi,..., Wm) can be obtained as 


Wi I A*,r ~ N(0, AjT^) 
A, ~f+(0,l). 


( 2 ) 


where denotes the half-Student-f prior with ly degrees of freedom. We will refer 
to (2) by the acronym HS-f^. The horsehoe prior (Carvalho et ah, 2009, 2010) is ob¬ 
tained by setting ly = 1, that is, by introducing half-Cauchy priors for the local scale 
parameters A^. The horseshoe prior has been shown to possess desirable theoretical 
properties and good performance in practice (Carvalho et ah, 2009, 2010; Datta and 
Ghosh, 2013; van der Pas et ah, 2014). Intuitively, we expect the local variance pa¬ 
rameters Af to be large for those inputs that have high relevance, and small for those 
with negligible relevance, while the global variance term adjusts the overall sparsity 
level. The shrinkage coefficient Ki = 1/(1 -f Af) describes the amount of shrinkage 
for the weight Wi, so that Ki = 0 means no shrinkage (Af large) and Ki = 1 complete 
shrinkage (Af small). 

Figure 1 shows the priors on Ki implied by different choices of ly. In all the cases, the 
prior encourages shrinkage (Ki ss 1) which is due to the fact that the density of the 
half-f prior evaluates to a positive constant near the origin, and thus contains mass near 
Xi = 0. Moreover, the long tails of the Cauchy distribution (ly = 1) allow some of the 
Xi to be very large, leaving those weights essentially unshrunk (Ki = 0). The reason 
why the horseshoe yields results closely similar to the spike-and-slab prior (Mitchell 
and Beauchamp, 1988) is due to this dual nature (complete shrinkage or no-shrinkage). 
When the value of ly is increased, the prior still allows strong shrinkage but leaves none 
of the weights completely unshrunk. 

Recently, an extended version of the horseshoe prior, called the horseshoe-r was pro¬ 
posed by Bhadra et al. (2015). Consider adding another level of local scale parameters 
to (2) as 

w* I Ai,r ~ N(0,A2?7 /t^) 

A*-f^(0,l), (3) 
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Figure 1: The shrinkage profile of the prior (2), that is, the prior distribution of the 
shrinkage coefficient Ki = 1/(1 + Af), for different values of i^. The case 1 ^ = 1 
corresponds to the horseshoe. Ki = 0 means no shrinkage and Ki = 1 complete 
shrinkage. 


We shall refer to the prior (3) as HS-f,y+. The horseshoe+ is then obtained by using 
half-Cauchy distributions, that is, setting v = 1 (note that the above notation differs 
slightly from the original paper). The authors argue that the horseshoe-t obtains an 
improved behaviour compared to the original horseshoe in ultra-sparse problems, both 
theoretically and empirically. 

The hierarchical priors (2) and (3) are straightforward to implement and use in Stan. 
However, in practice we observe that for the particular case of ly = 1 (horseshoe and 
horseshoe-t) NUTS produces a lot of divergent transitions even after the warm-up pe¬ 
riod'. Experimentally we find that the number of divergent transitions can be reduced 
by larger value of k, that is, by shortening the tails of the priors for the local scale terms 
(see Section 4.1). As depicted in Figure 1, this affects the shrinkage properties of the 
prior but as will be demonstrated in Section 4.1, the effect regarding the predictions 
may be negligible. See also the study by Ghosh et al. (2015) discussing the problems 
of the half-Cauchy prior for the logistic regression when the data is separable. 


3 Projection predictive variable selection 

This section shortly reviews the idea of the projection predictive method for variable 
selection. Section 3.1 presents the general idea and Section 3.2 discusses the linear 
Gaussian regression model as a special case. Our discussion is concise, for more in¬ 
formation about the assessment of the method, see our previous paper (Piironen and 
Vehtari, 2015), and for more on the theoretical considerations and related concepts, see 
the review by Vehtari and Ojanen (2012). 

*See this thread for outlining the problem https : //groups . google . com/d/msg/stan-dev/ 
ptlNNytNVUI/4PH09ekefBYJ 
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3.1 General framework 


The idea in the projection approach of Goutis and Robert (1998) and Dupuis and Robert 
(2003) is to simplify the full model M* by projecting the information in the posterior 
onto the submodels so that the predictions change as little as possible. Given the pa¬ 
rameters of the full model 6*, the projected parameters 9^ in the parameter space of 
submodel M± are defined as 

1 " 

9^ = argmm-'^KL{p{y\x„9*,M^)\\p{y\x„9,M±)). (4) 

9 n 

The discrepancy between the full model and the submodel is then defined to be the 
expectation of this divergence over the posterior of the full model 

1 " 

i—1 

The posterior expectation in (5) is in general not available analytically. Dupuis and 
Robert (2003) proposed calculating the discrepancy by drawing samples from 

the posterior of the reference model, calculating the projected parameters 
individually according to (4), and then approximating (5) as 

6{M4M^) « I \\p{y \ x,,9j,M4) . (6) 

i—1 s=l 

This approximation makes the computations feasible as it requires only ability to draw 
samples from the full model and a routine for solving the optimization problem (4). In 
general case the optimization problem can be solved numerically (using e.g. Newton’s 
method) but for the simplest models such as the linear Gaussian case, the minimization 
can be carried out analytically (see the next section). 

In model selection, we seek for submodels M± which have small discrepancy from the 
full model (6). The final assessment of how much the full model can be simplified, can 
be performed using cross-validation (see Section 4.2). 

3.2 Example: linear Gaussian model 

Consider the linear Gaussian model (1). For easier notation, we set the first term of 
w to be the intercept wq and the first input to be fixed xq = 1. For this model, the 
projected parameters (4) can be calculated analytically. Given a sample (w, from 
the posterior of the full model, the projected parameters are given by (see Appendix A) 

wx = (Xx'^Xx)"^Xx'^ Xw (7) 

CTx = cr^-I-—(Xw — Xxwx)'''(Xw — Xxwx), (8) 

n 

and the associated KL-divergence (for this particular sample) is 

fi(w,cr2) = ;^logE- 

2 ( 7 ^ 
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Here X = (xj,... ,x^) denotes the nx m predictor matrix of the full model, and Xj^ 
the contains those columns of X that correspond to the submodel we are projecting 
onto. The projection equations (7) and (8) have a nice interpretation. The projected 
weights are determined by the maximum likelihood solution with the observations y 
replaced by the fit of the full model f = Xw. The projected noise variance is the noise 
level of the full model plus the mismatch between the fits of the full and the projected 
model. 

As discussed in the previous section, we draw a sample {w^, from the posterior 

of the full model, compute the projected parameters and associated KL-divergences 
according to Equations (7), (8) and (9), and then estimate the discrepancy between the 
full and submodel as 

1 ^ 

<j(M,||M_L) = -^d(w„a2). (10) 

This procedure will produce a parsimonious model with exactly zero weights for the 
variables that are left-out. 


4 Numerical example: Crime dataset 

This section presents an example of fitting a regression model under the HS-f^ and HS- 
ti^+ priors using Stan (Section 4.1), and demonstrates how to perform the subsequent 
variable selection using R (Section 4.2). All the codes can be found online^, and the 
most relevant parts are also included in the Appendix. We use the Crime dataset^ used 
in our earlier study. After removing the features and instances with missing values, the 
data consist of 1992 instances with d = 102 predictor variables. We normalized all the 
input variables to have zero mean and unit variance, and log normalized the original 
target variable (total number of crimes per population) to get a real valued and more 
Gaussian output. For illustrational purposes, we split the data randomly into two so that 
n = 1000 points are used for model training and variable selection, and the remaining 
n = 992 points are used for testing. 


4.1 Model fitting 

We fit the linear Gaussian regression model (1) with all the p = 102 predictors under 
the HS-fi/ and HS-fj/H- priors with v = 1 and v = i (see Section 2). Note that we 
use these hierarchical priors only for the weights of the nonconstant inputs. For the 
intercept term we use a weakly informative prior 

t«o~N(0,52). 

For the global scale parameter r and for the noise variance cr^, we use the following 
uninformative priors 

r~C+(0,l) 

(X 1. 

^https://github.com/jtpi/rstan-varsel 

^https://archive.ics.uci.edu/ml/datasets/Communities+and+Crime 


5 



Weights 


Predictions 




HS-ti 


HS-ti 




Figure 2: Left column: Posterior means for the weights with different priors, HS- 
ti vs. HS-fa (top) and HS-fi+ vs. HS-f 3 + (bottom). HS-fi and HS-fi+ correspond to 
horseshoe and horsehoe+ (see Section 2). Right column: the same but for the predictive 
means on the test set. 


The Stan codes for fitting the models are given in Appendix B. 

Using the training data, we sample 4 chains each having 1000 samples, and from each 
chain we remove the first half as a warmup. Figure 2 shows the effect of the degrees 
of freedom ly in (2) and (3) on the posterior means of the weights and predictions on 
the test data. In both cases, two of the weights are shrunk more heavily towards zero 
under z/ = 3, while the predictions remain practically the same. The percentage of 
divergent transitions under the posterior simulations were 3.4% and 5.2% for HS-fi 
and HS-fi+, whereas only 0.0% and 0.1% for HS-fa and HS-fa+- Given the more 
robust sampling and negligible effect on the predictive performance, we proceed to 
the variable selection using = 3. However, we emphasize the tentative nature of 
this experiment and do not recommend this choice to be used uncritically. Thorough 
understanding of the effects of z/ needs more research, but we do not focus on it in this 
report. 


4.2 Variable selection 

After fitting the model (previous section) with all the variables we proceed to the vari¬ 
able selection. Here we use the HS-fa prior (Section 2) for the full model and the pro¬ 
jection predictive variable selection strategy (Section 3.2). As a search heuristic, we 
use forward searching, that is, starting from the empty model, we add variables one at a 
time, each time choosing the variable that decreases the KL-divergence (10) the most. 
As discussed in our study (Piironen and Vehtari, 2015), the effect of number of chosen 
variables on the predictive ability can be assessed reliably using cross-validation. In 
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Figure 3: Difference in the mean log predictive density (MLPD) and mean squared 
error (MSE) between the submodel and the full model as a function of number of 
chosen variables up to 50 variables. Black is the average over the K = 10 cross- 
validated searches, grey bars denoting the 95% credible interval, and green is the test 
performance when the search is done using all the training data. 


other words, we repeat the model htting and selection K times each time leaving njK 
points out for validation, and evaluate the performance of the full model and the found 
submodels using these left-out data. The relevant R codes are provided in Appendix C. 

Figure 3 shows the difference in the mean log predictive density and mean squared error 
between the submodel and the full model as a function of number of added variables 
up to 50 variables. The black line is the average over the AT = 10 cross-validation folds 
and the green line shows the result when the htting and searching is performed using 
all the training data and the performance is evaluated on the test data. As expected, 
there is a good correspondence between the cross-validated and test performance. For 
this dataset, most of the predictive ability of the full model is captured with about 5 
variables, and 20 variables are enough for getting predictions indistinguishable from 
the full model for all practical purposes. 
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A Projection for linear Gaussian model 


Let denote the fit of the submodel, where wj^ is the projected weight 

vector and the input vector corresponding to the submodel on which we are pro¬ 
jecting. Moreover, let denote the projected noise variance. The cost function in (4) 
can be written as 


iy:KL(N(/.,,“) ||N(/.^,4)) 






= S + if’ - - ') 

= ^ ^ ^ <f- -1) • 

= T floS ^ ^ + -(Xw - Xj_w_l)^(Xw - Xj_w_l)^ 

2 \ \ n J 



(A.l) 


where f = (/i,..., /„), fj^ = {fi ,..., f^), and X and Xj^ denote the predictor 
matrices of the full and submodel, respectively. Setting the gradient of (A.l) with 
respect to wj^ to zero, we obtain 

wx = (Xx'^Xx)"^Xx'^ Xw. 
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(A.2) 



Accordingly, minimizing (A.l) with respect to cr^ gives us 

(T^ = H—(Xw — Xj_wx)'''(Xw — Xj^wx). (A.3) 

n 

Plugging the expressions (A.2) and (A.3) into (A.l) gives us the divergence introduced 
by the projection of these particular parameters (w, cr^) 

n 

fi(w,cr2) = -^KL(N(/„cr2) ||N(/^,cri)) 

= ^log^. (A.4) 


B Stan codes 

Linear regression with Gaussian noise and prior (2) on the weights. The case 

ly = 1 corresponds to the horseshoe. 

/* . Stan */ 

functions { 

// square root of a vector (elementwise) 
vector sqrt_vec(vector x) { 
vector[dims(x)[1]] res; 

for (m in 1:dims(x)[1]){ 
res[m] <- sqrt(x[m]); 

} 

return res; 



data { 

int<lower=0> n; // number of observations 
int<lower=0> d; // number of predictors 
vector[n] y; // outputs 

matrix[n,d] X; // inputs 

real<lower=l> nu; // degrees of freedom for the half t-priors 


parameters { 

// intercept and noise std 
real wO; 

real<lower=0> sigma; 

// auxiliary variables for the variance parameters 
vector[d] z; 

real<lower=0> rl_global; 
real<lower=0> r2_global; 
vector<lower=0>[d] rl_local; 
vector<lower=0>[d] r2_local; 


transformed parameters { 
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// global and local variance parameters^ and the input weights 
real<lower=0> tau; 
vector<lower=0>[d] lambda; 
vector[d] w; 

tau <- rl_global * sqrt(r2_global); 
lambda <- rl_local .* sqrt_vec(r2_local); 
w <- z .* lambda*tau; 


model { 

// observation model 
y ~ normal(wO + X*w, sigma); 

// half t-priors for lambdas (nu = 1 corresponds to horseshoe) 

z ~ normal(0, 1); 

rl_local ~ normal (0.0, 1.0); 

r2_local ~ inv_gamma(0.5*nu, 0.5*nu); 

// half cauchy for tau 
rl_global ^ normal(0.0, 1.0); 
r2_global ~ inv_gamma(0.5, 0.5); 

// weakly informative prior for the intercept 
wO ~ normal(0,5); 

// using uniform prior on the noise variance 


The same model as above, but with HS-t^+ prior (3) on the weights. The case v 
corresponds to the horseshoe+. 

/* lg_tplus. Stan */ 
functions { 

// square root of a vector (elementwise) 
vector sqrt_vec(vector x) { 
vector[dims(x)[1]] res; 

for (m in 1:dims(x)[1]){ 
res[m] <- sqrt(x[m]); 

} 

return res; 



data { 

int<lower=0> n; // number of observations 
int<lower=0> d; // number of predictors 
vector[n] y; // outputs 

matrix[n,d] X; // inputs 

real<lower=l> nu; // degrees of freedom for the half t-priors 

} 


parameters { 


// intercept and noise std 
real wO; 

real<1ower=0> sigma; 
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// auxiliary variables for the variance parameters 
vector[d] z; 

real<lower=0> rl_global; 
real<lower=0> r2_global; 
vector<lower=0>[d] rl_local; 
vector<lower=0>[d] r2_local; 
vector<lower=0>[d] rl_localplus; 
vector<lower=0>[d] r2_localplus; 


transformed parameters { 

// global and local variance parameters, and the input weights 

real<lower=0> tau; 

vector<lower=0>[d] lambda; 

vector<lower=0>[d] lambdaplus; 

vector[d] w; 

tau <- rl_global * sqrt(r2_global); 
lambda <- rl_local .* sqrt_vec(r2_local); 
lambdaplus <- rl_localplus sqrt_vec(r2_localplus); 

w <- z .* lambda .* lambdaplus * tau; 


model { 

// observation model 
y ~ normal(wO + X*w, sigma); 

// half t-priors for all the lambdas (nu = 1 corresponds to horseshoe+) 

z ~ normal(0, 1); 

rl_local ~ normal(0.0, 1.0); 

r2_local ~ inv_gamma(0.5*nu, 0.5*nu); 

rl_localplus ~ normal(0.0, 1.0); 

r2_localplus ~ inv_gamma(0.5*nu, 0.5*nu); 

// half cauchy for tau 
rl_global ^ normal(0,0, 1.0); 
r2_global ~ inv_gamma(0.5, 0.5); 

// weakly informative prior for the intercept 
wO ~ normal(0,5); 

// using uniform prior on the noise variance 


C R codes 

Code for performing the projection onto a submodel for linear Gaussian model, given 
a sample from the full model; 

lm_proj <- function(w,sigma2,X,indproj) { 

# assume the intercept term is stacked into w, and x contains 

# a corresponding vector of ones, returns the projected samples 

# and estimated kl-divergence. 

# pick the columns of x that form the projection subspace 
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n <- dim(x) [1] 
xp <- x[,indproj] 

# solve the projection equations 

fit <- X %*% w # fit of the full model 

wp <- solve(t(xp) %*% xp, t(xp) %*% fit) 

sigma2p <- sigma2 + colMeans ( (fit - xp %*% wp)'^2) 

# this is the estimated kl-divergence between the full and projected model 
kl <- mean(0.5*log(sigma2p/sigma2)) 

# reshape wp so that it has same dimensionality as x, and zeros for 

# those variables that are not included in the projected model 
d <- dim(w)[1] 

S <- dim(w) [2] 
wptemp <- matrix(0, d, S) 
wptemp[indproj, ] <- wp 
wp <- wptemp 

return(list(w=wp, sigma2=sigma2p, kl=kl) ) 

} 


Forward variable searching by minimizing the associated KL-divergence in the projec¬ 
tion; 

lm_fprojsel <- function(w, sigma2, x) { 

# forward variable selection using the projection, returns the 

# indices of the variables chosen (in order) and accociated kl-divergences. 

d = dim(x)[2] 

chosen <- 1 # chosen variables^ start from the model with the intercept only 
notchosen <- setdiff(l:d, chosen) 

# start from the model having only the intercept term 
kl <- rep(0,d) 

kl[l] <- lm_proj(w,sigma2,X,1)$kl 

# start adding variables one at a time 
for (k in 2:d) { 

nleft <- length(notchosen) 
val <- rep(0, nleft) 

for (i in linleft) { 

ind <- sort ( c(chosen, notchosen[i]) ) 

proj <- lm_proj (w, sigma2, X, ind) 
val[i] <- proj$kl 

} 


# find the variable that minimizes the kl 
imin <- which.min(val) 
chosen <- c(chosen, notchosen[imin]) 
notchosen <- setdiff(l:d, chosen) 

kl[k] <- val[imin] 

} 

return (list(chosen=chosen, kl=kl)) 
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Example of sampling the full model and performing the variable selection using the 
projection: 

# load the data 

n <- 1000 # number of training points 

data <- as.matrix(read.table('crimedata.CSV', sep=',')) 
dims <- dim(data) 
ntotal <- dims[l] 

nt <- ntotal - n # number of test points 
d <- dims[2]-1 

# training data 

y <- data[l:n, 1] 

X <- data[l:n, 2:(d+l)] 

# test data 

yt <- data[(n+1)intotal, 1] 
xt <- data[(n+1):ntotal, 2:(d+l)] 

# fit the full model 

fit <- Stan("lg_t.Stan", data=list(X=x,y=y,d=d,n=n,nu=3.0), iter=100, chains=l) 
e <- extract(fit) 

w <- rbind(e$w0, t(e$w)) # stack the intercept and weights 

sigma2 <- e$sigma''2 

X <- cbind(rep(1,n), x) # add a vector of ones to the predictor matrix 
xt <- cbind(rep(1,ntest), xt) 

# perform the variable selection 
spath <- lm_fprojsel(w, sigma2, x) 

# compute the predictions on the test set using the projected parameters 
mlpd <- rep(0, d+1) 

mse <- rep(0, d+1) 
for (k in 1:(d+1)) { 

# projected parameters 

submodel <- lm_proj(w, sigma2, x, spath$chosen[1:k]) 

wp <- submodel$w 

sigma2p <- submodel$sigma2 

# mean squared error 

ypred <- rowMeans(xt %*% wp) 
mse[k] <- mean ( (yt-ypred) "'I) 

# mean log predictive density 

pd <- dnorm(yt, xt %*% wp, sqrt(sigma2p)) 
mlpd[k] <- mean(log(rowMeans(pd))) 

} 
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